Context dependent statistical augmentation of persian transcripts

نویسندگان

  • Panayiotis G. Georgiou
  • Shrikanth S. Narayanan
  • Hooman Shirani Mehr
چکیده

Persian language is transcribed in a lossy manner as it does not, as a rule, encode vowel information. This renders the use of the written script suboptimal for language models for speech applications or for statistical machine translation. It also causes the text-to-speech synthesis from a Persian script input to be a one-to-many operation. In our previous work, we introduced an augmented transcription scheme that eliminates the ambiguity present in the Arabic script. In this paper, we propose a method of generating the augmented transcription from the Arabic script by statistically decoding through all possibilities and choosing the maximum likelihood solution. We demonstrate that even with a small amount of initial bootstrap data, we can achieve a decoding precision of about 93% with no human intervention. The precision can be as high as 99.2% in a semi-automated mode where low confidence decisions are marked for human processing.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improved Bayesian Training for Context-Dependent Modeling in Continuous Persian Speech Recognition

Context-dependent modeling is a widely used technique for better phone modeling in continuous speech recognition. While different types of context-dependent models have been used, triphones have been known as the most effective ones. In this paper, a Maximum a Posteriori (MAP) estimation approach has been used to estimate the parameters of the untied triphone model set used in data-driven clust...

متن کامل

Allophone-based acoustic modeling for Persian phoneme recognition

Phoneme recognition is one of the fundamental phases of automatic speech recognition. Coarticulation which refers to the integration of sounds, is one of the important obstacles in phoneme recognition. In other words, each phone is influenced and changed by the characteristics of its neighbor phones, and coarticulation is responsible for most of these changes. The idea of modeling the effects o...

متن کامل

Persian-Spanish Low-Resource Statistical Machine Translation Through English as Pivot Language

This paper is an attempt to exclusively focus on investigating the pivot language technique in which a bridging language is utilized to increase the quality of the Persian–Spanish low-resource Statistical Machine Translation (SMT). In this case, English is used as the bridging language, and the Persian–English SMT is combined with the English–Spanish one, where the relatively large corpora of e...

متن کامل

MIZAN: A Large Persian-English Parallel Corpus

One of the most major and essential tasks in natural language processing is machine translation that is now highly dependent upon multilingual parallel corpora. Through this paper, we introduce the biggest Persian-English parallel corpus with more than one million sentence pairs collected from masterpieces of literature. We also present acquisition process and statistics of the corpus, and expe...

متن کامل

Cross-Lingual Word Sense Disambiguation for Languages with Scarce Resources

Word Sense Disambiguation has long been a central problem in computational linguistics. Word Sense Disambiguation is the ability to identify the meaning of words in context in a computational manner. Statistical and supervised approaches require a large amount of labeled resources as training datasets. In contradistinction to English, the Persian language has neither any semantically tagged cor...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004